PetFinder Adoption Final Report

Introduction

In 2016 alone there were over 2 million animals that ended up in animal shelters and were not adopted. Shelters around the world work constantly to attempt to lower this number and place deserving animals in loving homes. PetFinder, an animal adoption website, assists with this goal by allowing animal profiles to be put online for people to find. Getting a profile onto PetFinder is the first step, making an effective profile is more difficult. There are a number of questions that need to be answered about the profile. How many photos to include, how many videos, should they list the vaccination status of the animal, etc? We determined during our last study that when the adoption speed of an animal is numerical we were not able to successfully predict this speed given the descriptors of the profile. Given this, we decided to ask more complex questions about this dataset to see if we could determine patterns in adoption profiles that could give potential insight to adoption centers as they work to get animals adopted.

Preprocessing

We’re using the same dataset as from the midterm assignment. The data comes from Kaggle (https://www.kaggle.com/c/petfinder-adoption-prediction/data) and includes almost 15,000 observations of 23 variables. Each observation is an adoption profile on PetFinder in Malaysia. We chose to keep 9 features, as well as discarding any profile that contained more than one animal, and any animal that did not get adopted in the first 100 days. At the end of these cleaning measures we end up with 8485 observations of the 9 features. Initially we transformed the categorical dependent variable (adoption speed) into a continuous numerical variable. For the sake of this project we are keeping it as a categorical variable for some of our analyses.

The variables in our dataset were all imported as numerical values. In order to remain accurate to the data we needed to convert certain features to different data types. The Type of animal (Dog = 1, Cat = 2) was converted to a factor level variable along with the MaturitySize (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified), FurLength (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified), Vaccinated (1 = Yes, 2 = No, 3 = Not Sure), and the Gender (1 = male, 2 = Female). Finally, the AdoptionSpeed was made into an ordinal variable with the levels 0 (Adopted the first day), 1 (Adopted the first week), 2 (Adopted the first month), and 3 (Adopted in the first 3 months). The Age, PhotoAmt, and VideoAmt all remained numeric.

The next preprocessing step was to split the data into a train - test split in preparation for modeling. We split the data in a 70/30 split, with the final train set containing 5941 observations and the test set containing 2544 observations.

Background EDA

As a refresher on the basic features of the dataset we can look at the distribution of the various variables.

First here is our dependent variable Adoption Speed, plotted by number of occurences since it is categorical.

We can also look at the other numerical features.

Next we can see the distributions of the categorical variables.

The second is Gender. There are two genders: Male and Female.
The third is MaturitySize. There are four levels, namely Small,Medium, Large, and ExtraLarge.
The fourth is FurLength. There are three levels, namely Short, Medium, and Long.
The fifth is Vaccinated. There are three levels, namely Vaccinated, Unvaccinated, and Not Specified.

After this we did some EDA to see what features affected adoption speed. We initially had a numerical dependent variable so we converted it to a continuous numerical feature. We then ran ANOVAs on a number of the features against the now numerical adoption speed. We discovered that if we disregarded the asumption of equal variance, every feature significantly impacted adoption speed.

## 
##  Welch Two Sample t-test
## 
## data:  dogs$ASnum and cats$ASnum
## t = 9, df = 8211, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.91 6.06
## sample estimates:
## mean of x mean of y 
##      28.6      23.6
## 
##  Welch Two Sample t-test
## 
## data:  male$ASnum and female$ASnum
## t = -5, df = 8268, p-value = 5e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.87 -1.70
## sample estimates:
## mean of x mean of y 
##      24.9      27.6
## Analysis of Variance Table
## 
## Response: ASnum
##                Df  Sum Sq Mean Sq F value  Pr(>F)    
## MaturitySize    3   46482   15494    24.2 1.3e-15 ***
## Residuals    8481 5426228     640                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: ASnum
##             Df  Sum Sq Mean Sq F value  Pr(>F)    
## FurLength    2   24112   12056    18.8 7.4e-09 ***
## Residuals 8482 5448597     642                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: ASnum
##              Df  Sum Sq Mean Sq F value Pr(>F)    
## Vaccinated    2   58767   29383      46 <2e-16 ***
## Residuals  8482 5413943     638                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SMART QUESTION: What is the probability that a large, female, vaccinated dog would get adopted quickly (adoption speed of 0 and 1).

Objective:It has always been a tough problem to know when the pets will be adopted, so that the pet adoption organization could prepare the pet food and resources accordingly instead of hopping the pet will be adopted at some day. For example, if the pet adoption organization know when a large female vaccinated dog will have a have chance to get adopted quickly,within a week,the organization will save a lot of funding to prepare resources for this pet,and they could use these saved resources to keep other pets alive. This is our goal for this smart question,What is the probability that a large, female, vaccinated dog would get adopted quickly (adoption speed of 0 and 1).

Before the data processing, there are 331 pets at the the adoption speed of 0, 2439 pets at the the adoption speed of 1,3163 pets at the the adoption speed of 2,2552 pets at the the adoption speed of 3,no pets at the the adoption speed of 4.
Table
Type Gender MaturitySize Vaccinated AdoptionSpeed
X 1:4754 1:3837 1:1890 1:3498 0: 331
X.1 2:3731 2:4648 2:5816 2:3939 1:2439
X.2 NA 3: 0 3: 751 3:1048 2:3163
X.3 NA NA 4: 28 NA 3:2552
X.4 NA NA NA NA 4: 0

In order to answer this question, the first step is to group the adoption speed 1 and 0. After grouping, we can see that 2770 pets are grouped in to 0.

##  Type     Gender   MaturitySize Vaccinated AdoptionSpeed
##  1:4754   1:3837   1:1890       1:3498     0:2770       
##  2:3731   2:4648   2:5816       2:3939     2:3163       
##           3:   0   3: 751       3:1048     3:2552       
##                    4:  28                  4:   0

The next step is to group the adoption speed 3 and 4. After grouping, we can see that 2552 pets are grouped in to 3.

##  Type     Gender   MaturitySize Vaccinated AdoptionSpeed
##  1:4754   1:3837   1:1890       1:3498     0:2770       
##  2:3731   2:4648   2:5816       2:3939     2:3163       
##           3:   0   3: 751       3:1048     3:2552       
##                    4:  28
Then, group the adoption speed 2 and 3. After grouping, we can see that 5715 pets are grouped in to 2.
Table
Type Gender MaturitySize Vaccinated AdoptionSpeed
X 1:4754 1:3837 1:1890 1:3498 0:2770
X.1 2:3731 2:4648 2:5816 2:3939 2:5715
X.2 NA 3: 0 3: 751 3:1048 NA
X.3 NA NA 4: 28 NA NA

Lastly, we set the these 2 levels to 0 and 1.From the data, we can tell that 2770 pets are at the level of adoption speed 0 and 1, which means 2770 pets will be adopted within 1 week, and 5715 at the adoption speed of 1, which means 5715 pets will be adopted more than 1 week.

##  Type     Gender   MaturitySize Vaccinated AdoptionSpeed
##  1:4754   1:3837   1:1890       1:3498     0:2770       
##  2:3731   2:4648   2:5816       2:3939     1:5715       
##           3:   0   3: 751       3:1048                  
##                    4:  28

Logistic Model

## 
## Call:
## glm(formula = AdoptionSpeed ~ Type + MaturitySize + Vaccinated + 
##     Gender, family = "binomial", data = new_dat)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.737  -1.387   0.796   0.913   1.196  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.6371     0.0680    9.37  < 2e-16 ***
## Type2          -0.2726     0.0494   -5.52  3.5e-08 ***
## MaturitySize2   0.4432     0.0564    7.85  4.1e-15 ***
## MaturitySize3  -0.0812     0.0897   -0.91  0.36514    
## MaturitySize4   0.1526     0.4104    0.37  0.71010    
## Vaccinated2    -0.3267     0.0523   -6.25  4.2e-10 ***
## Vaccinated3    -0.1363     0.0770   -1.77  0.07664 .  
## Gender2         0.1777     0.0472    3.77  0.00016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10719  on 8484  degrees of freedom
## Residual deviance: 10505  on 8477  degrees of freedom
## AIC: 10521
## 
## Number of Fisher Scoring iterations: 4

From this logistic model, we can most of variables are statistically significant except MaturitySize3,MaturitySize4,and Vaccinated3. We will run the logistic model evaluation to see if this model is accurate or not.

Model Evaluation

In this part, we will use confusion matrix, Hosmer and Lemeshow goodness of fit (GOF) test, and ROC curve to examin the model accuracy.

Confusion Matrix

Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 32 2738 2770
Actual 1 55 5660 5715
Total 87 8398 8485

Accuracy: 0.6708 #### GOF Test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  new_dat$AdoptionSpeed, fitted(logit)
## X-squared = 8485, df = 8, p-value <2e-16

p-value <2e-16

ROC Curve

From the ROC Score 0.599 is < 0.8, which proves the prediction from this model may not be a accurate prediction.

##   Type MaturitySize Gender Vaccinated   fit se.fit residual.scale    UL    LL
## 1    1            3      2          1 0.734 0.0847              1 0.711 0.638
##   PredictedProb
## 1         0.676

SMART QUESTION: Does animal profile influence the adoption speed significantly, and what is the best model (Logistic Regression, Knn) when considering these variables?

We research this question because although all pets are cute, there are still some pets are not popular, so we make this model to see what influence the AdoptionSpeed, and see how we can improve it, or let the website of adoption to make some activities to show the cute aspect of the relative not popular pets to adopters.

Firstly, we do some feature selection.

Feature Selection

Logistic Regression

Logistic Regression : Type ~ age+gender+size+fur+vaccinated+AdoptSpd+photos
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.4690 0.0828 5.662 0.0000
Age 0.0048 0.0015 3.181 0.0015
Type2 -0.2858 0.0505 -5.653 0.0000
Gender2 0.1918 0.0476 4.029 0.0001
MaturitySize2 0.4552 0.0575 7.917 0.0000
MaturitySize3 -0.0151 0.0916 -0.165 0.8687
MaturitySize4 0.2450 0.4152 0.590 0.5550
FurLength2 -0.2822 0.0504 -5.598 0.0000
FurLength3 -0.6886 0.0932 -7.386 0.0000
Vaccinated2 -0.3074 0.0550 -5.585 0.0000
Vaccinated3 -0.0844 0.0777 -1.086 0.2774
PhotoAmt 0.0666 0.0085 7.836 0.0000

From the model of glm, I know that videoAmt is not statistically significant, so I remove this variable from the model. Let us do some model evaluation to see if the model is good or not.

Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 240 2530 2770
Actual 1 207 5508 5715
Total 447 8038 8485

Accuarcy = 0.677.
Precision = 0.685.
Recall rate = 0.964.
Specificity = 0.913.
F1 score = 0.801.
We can see the Accuarcy is 0.677, which means this is not a bad model.

Hosmer and Lemeshow test
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  dataTGMFV$AdoptionSpeed, fitted(petFeaturetableAll)
## X-squared = 8485, df = 8, p-value <2e-16

The p-value of 0 is relatively small. Small p-values mean that the model is a poor fit. And the next evaluation is AUC/ROC.

AUC/ROC
## Area under the curve: 0.621

The result of AUC/ROC show that this is not a very good model Area under the curve: 0.621(less than 0.8). After looking at AUC/ROC, let us try McFadden.

McFadden
##       llh   llhNull        G2  McFadden      r2ML      r2CU 
## -5.18e+03 -5.36e+03  3.56e+02  3.32e-02  4.10e-02  5.72e-02

With the McFadden value of 0.033, which is analogous to the coefficient of determination \(R^2\), only about 0.033 of the variations in y is explained by the explanatory variables in the model.

K Nearest Neighbor

Table
k Total.Accuracy
1 3 0.620
3 5 0.639
5 7 0.641
7 9 0.644
9 11 0.651
11 13 0.650
13 15 0.655
15 17 0.663
17 19 0.661
19 21 0.655

The best KNN is 17, so let us see the confusion matrix.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2802 
## 
##  
##                | pet_pred 
## pet.testLabels |         0 |         1 | Row Total | 
## ---------------|-----------|-----------|-----------|
##              0 |       172 |       719 |       891 | 
##                |     0.193 |     0.807 |     0.318 | 
##                |     0.449 |     0.297 |           | 
##                |     0.061 |     0.257 |           | 
## ---------------|-----------|-----------|-----------|
##              1 |       211 |      1700 |      1911 | 
##                |     0.110 |     0.890 |     0.682 | 
##                |     0.551 |     0.703 |           | 
##                |     0.075 |     0.607 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       383 |      2419 |      2802 | 
##                |     0.137 |     0.863 |           | 
## ---------------|-----------|-----------|-----------|
## 
## 

Accuracy = 0.668.
Precision = 0.449.
Recall rate = 0.193.
Specificity = 0.89.
F1 score = 0.27.
Accuracy = 0.668,so this model also looks not good enough.

As a result, we need more variables and data to predict AdoptionSpeed.

SMART QUESTION: Can the type of animal be classified from the adoption profile?

The next question we wanted to answer switched our attention to the type of animal profile. We determined that we couldn’t predict the adoption speed effectively with classification models. So we thought that we would take a different approach to determining patterns in the adoption profiles. If a classification model was able to predict the type of animal from the adoption profile then it would suggest a pattern in either the animals themselves, or the way the profiles were written. Either possibility leads to more insight about the profiles and possible interventions to ensure the maximum animals are adoptopted into loving homes.

Because this is a binary classification problem we decided to use a logistic regression model and a K nearest neighbors model. Beginning with the logistic model, we ran feature selection to determine what features we should include.

The results of the feature selection are as follows.

##    Age Gender MaturitySize FurLength Vaccinated PhotoAmt VideoAmt AdoptionSpeed
## 1 TRUE   TRUE         TRUE      TRUE       TRUE     TRUE    FALSE          TRUE

The features that were selected using the AIC criteria and the exhaustive method are every feature except for VideoAmt. We can now make that model and test it against the animal type. The results of this model are as follows.

Logistic Regression : Type ~ age+gender+size+fur+vaccinated+AdoptSpd+photos
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2594 0.0822 3.154 0.0016
Age -0.0179 0.0018 -10.212 0.0000
Gender2 -0.1801 0.0473 -3.808 0.0001
MaturitySize2 -0.9701 0.0589 -16.464 0.0000
MaturitySize3 -0.7066 0.0950 -7.440 0.0000
MaturitySize4 -0.9954 0.4609 -2.160 0.0308
FurLength2 -0.1634 0.0501 -3.257 0.0011
FurLength3 0.5550 0.0978 5.677 0.0000
Vaccinated2 0.8690 0.0528 16.446 0.0000
Vaccinated3 -0.1552 0.0794 -1.955 0.0506
AdoptionSpeed.L -0.4891 0.0867 -5.640 0.0000
AdoptionSpeed.Q 0.0333 0.0698 0.477 0.6336
AdoptionSpeed.C -0.0480 0.0479 -1.002 0.3165
PhotoAmt 0.0566 0.0074 7.620 0.0000

This model has multiple statistically significant coefficient values. Every feature except Vaccination level 3, AdoptionSpeed fitted with a quadratic fit and a cubic fit have p_values above our alpha of 0.05. This suggests that almost every factor is significantly impacting the models decision about the type of animal the profile represents. We can then move on and look at different metrics to determine if this model was effective. First we can look at the various accuracy metrics of the model. The confusion matrix and metrics for this model is as follows.

Type Confusion matrix from Logit Model
Predicted 1 Predicted 2 Total
Actual 1 3512 1242 4754
Actual 2 1694 2037 3731
Total 5206 3279 8485

Accuracy: 0.654
Precision: 0.675
Recall: 0.739
F1: 0.705

These numbers aren’t terrible, but let’s look at other metrics as well.

We can look at the Hoslem and Lemeshow Goodness of Fit test next.

H0: The model is a good fit for our dependent variable
H1: The model is a bad fir for our dependent variable

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  datclean$y, fitted(typeLogit)
## X-squared = 8485, df = 8, p-value <2e-16

With a p_value of 0 we can reject the null hypothesis that this model was a good fit for our dependent variable of animal type. In order to confirm this finding we wanted to look at the Receiver Operator Curve of this model. This plot shows the true positive rate against the false positive rate. By plotting the model in this way we can see how well above chance the model preforms.

We can see by looking at this plot that the model is not very effective. While it is above chance, we ideally want to see a more defined box-shape of the curve which would suggest that it is able to accurately predict the type of animal. The area under this curve confirms that the model is not great. With our AUC = 0.702, this is below the threshold of 0.8 that we are looking for to consider a model an effective classifier.

The final metric we can use to confirm the state of our logistic model is the McFadden score. This is a proxy for an \(R^2\) value.

## fitting null model for pseudo-r2

With a McFadden score of 0.0972 we can say that only 9.72% of the variance in Type is explained by this model.

After looking at all these metrics we have confirmed that this model is not adept at classifying type based on the adoption profile of the animal. Given this, we decided to move on to the KNN model in the hopes that this model would be more effective.

The first step to use this model is cleaning the dataset. First, the numeric variables Age, Photo Amount and Video Amount are scaled to center them. Then all the factor level variables are returned to numeric variables for the model.

In order to determine the optimal K value for the KNN model we ran through values 3-20 and chose the K with the highest total accuracy.

The K value that was determined to be the best for this model was k = 18 with a total accuracy of 0.68.

To make sure that we’re getting the best possible model we tried one more KNN model before looking closer at the results. We removed Video Amount from the feature list since it was removed in the logistic regression model. We did the same run through of K values from 3-20.

The K value that was determined to be the best for this model was k = 20 with a total accuracy of 0.683. This accuracy is very slightly better than the previous model, so we will continue with this model going forward.

Taking a closer look at this second model we can pull out a number of statistics.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2544 
## 
##  
##              | tpred 
##    testLabel |         1 |         2 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |      1173 |       279 |      1452 | 
##              |     0.808 |     0.192 |     0.571 | 
##              |     0.691 |     0.330 |           | 
##              |     0.461 |     0.110 |           | 
## -------------|-----------|-----------|-----------|
##            2 |       525 |       567 |      1092 | 
##              |     0.481 |     0.519 |     0.429 | 
##              |     0.309 |     0.670 |           | 
##              |     0.206 |     0.223 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1698 |       846 |      2544 | 
##              |     0.667 |     0.333 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

Accuracy: 0.684
Precision: 0.691
Recall: 0.808
F1: 0.745

All of these numbers are higher than our Logistic regression model so we can confirm that this model is slightly better at decoding type of animal from the adoption profile. While the model is better than chance, it’s still not incredibly effective. This suggests that while there might be some discernible pattern to the profiles of the dogs and cats put on petfinder, it is not drastic enough that the model is incredibly accurate.

That being said, future studies should look into this pattern to see what the differences are specifically and whether or not they have an effect on adoption speed.

SMART Question: Can puppies/kittens be identified based on their adoption profile?

Objective

A common problem at animal shelters is difficulty in accurately determining the age of the pet. Many families/prospective pet adopters have specific preferences for younger or older animals. Determining if the other data can generate an accurate prediction of whether or not an animal is very young or not seem like a reasonable first step in determining if age can at all be predicted using this data. As 55.05 percent of the pets are young, if there is any hope is predicting age beginning with a relatively balanced output is valuable.

To answer the SMART question, let us consider the general problem. There is a categorical dependent variable, and a mixture of numerical and categorical independent variables. A Logistic Regression makes a great deal of sense in this situation. A classification-tree model is an option if left unsatisfied with the logistic model’s outcome.

Logistic Model

  • Logistic Model Feature Selection
##   Type Gender MaturitySize FurLength Vaccinated PhotoAmt VideoAmt AdoptionSpeed
## 1 TRUE   TRUE         TRUE      TRUE       TRUE     TRUE     TRUE          TRUE

Starting off with a full model, excluding Age as it would be collinear to puppy. Feature Selection indicates all variables should be kept.

Logistic Regression : Puppy ~ All variables
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6108 0.0833 -7.331 0.0000
Type2 -0.5404 0.0543 -9.945 0.0000
Gender2 0.1196 0.0497 2.408 0.0160
MaturitySize2 0.2332 0.0605 3.855 0.0001
MaturitySize3 -0.8213 0.1030 -7.971 0.0000
MaturitySize4 -1.3317 0.5609 -2.374 0.0176
FurLength2 -0.1142 0.0520 -2.196 0.0281
FurLength3 -1.2641 0.1100 -11.493 0.0000
Vaccinated2 1.8752 0.0567 33.101 0.0000
Vaccinated3 0.2664 0.0752 3.542 0.0004
PhotoAmt 0.0382 0.0081 4.717 0.0000
VideoAmt 0.1500 0.0794 1.889 0.0590
AdoptionSpeed.L -0.1447 0.0897 -1.612 0.1069
AdoptionSpeed.Q -0.3056 0.0722 -4.231 0.0000
AdoptionSpeed.C 0.0230 0.0507 0.454 0.6502

The coefficient table for the recommended logistic model has two factor levels for AdoptionSpeed considered to not be significant, otherwise all other variables are significant. Notice how strong the z-value is for Vaccinated2. This standout will return in the classification tree performed later.

Logistic Model Evaluation

Confusion Matrix for Logistic Model

Confusion matrix from Logit Model
Predicted 0 Predicted 1 Total
Actual 0 2730 1084 3814
Actual 1 1493 3178 4671
Total 4223 4262 8485

Accuracy: 0.696
Precision: 0.646
Recall: 0.716
F1: 0.679

Not the worst confusion matrix ever seen. Model over predicts old dogs and under predicts puppies. Additional tests will be needed to determine model quality.

Hosmer and Lemeshow goodness of fit (GOF) test

H0 = Model predictions are not statistically different to actual values. H1 = Model predictions are statistically different than actual values.

-The p-value of the test is 0, which is very low. The null hypothesis is rejected, and the alternative hypothesis accepted. According to this test the generated model is not a good fit. Based on this finding, as well as the values generated from the confusion matrix, confidence in the model is declining.

Reciever-Operator Curve

The shape is decent, closer to a parable, doesn’t exactly follow the diagonal. However, an AUC score of 0.702 is below the recommended 0.8 thresh-hold, but indicates a better fit than the Hosmer-Lemeshow test would suggest.

The McFadden score of 10% is pretty low. As the McFadden score is essentially a pseudo \(R^2\) value for logistic models, only 10% of the variation in the data is explained by the model.

With an accuracy score of 0.696, a GOF test indicating the models do not fit together, an AUC 0.702 < 0.8, and a fairly low pseudo \(R^2\) value, it is safe to conclude that the dataset can only very weakly predict the age of an animal using a logistic model.

Let’s see if a classification tree can give a better score. It is possible the logistic model is just inappropriate for this dataset.

Classification Tree

Using all possible variables against the dependent variable, with max depth set to 20.

Classification Tree Model Evaluation

## 
## Classification tree:
## rpart(formula = y ~ Type + Gender + MaturitySize + FurLength + 
##     Vaccinated + PhotoAmt + VideoAmt + AdoptionSpeed, data = datage, 
##     method = "class", control = list(maxdepth = 20))
## 
## Variables actually used in tree construction:
## [1] Vaccinated
## 
## Root node error: 3814/8485 = 0.4
## 
## n= 8485 
## 
##     CP nsplit rel error xerror xstd
## 1 0.32      0       1.0    1.0 0.01
## 2 0.01      1       0.7    0.7 0.01

The above summary gives the surprising conclusion that only Vaccinated will be used in the classification tree. Considering Vaccinated is a categorical variable of three levels, there really cannot be more than three end leafs, with 2 splits at most.

Confusion Matrix for Classification Tree

confusion matrix
0 1
0 2883 1663
1 931 3008

Accuracy = 69.43%
Precision = 63.42%
Recall = 75.59%
F1 = NA%

The recall rate of the classification tree is higher, and the precision rate lower, than the logistic model, but they have similar levels of accuracy. It is surprising that Vaccinated alone is able to generate similar levels of accuracy as the logistic model which incorporates every possible variable.

Tree Plot

The left branch is non-puppy/kitten animals. The right branch is puppy/kitten.This model recommends going along the left branch if vaccination status is 1 or 3, which indicates the animal is respectively already vaccinated, or its vaccine status is unknown. Going right for vaccination status 2, which indicates the animal is vaccinated. The model is saying that un-vaccinated animals are puppies/kittens.

Conclusion for SMART Question 3:

Both the logistic and the classification tree models generated similar outcomes when judging by accuracy. The conclusion which must be drawn is the other data in this set is largely incapable of predicting whether an animal is a puppy/kitten or an older more mature animal.

Conclusion of Paper

  • Targeted Prediction Accuracy: .675

  • Adoption Speed Prediction: Logistic Model, Accuracy .65, McFadden = .027

  • Pet Type Prediction Accuracy: Logistic Model, .677, McFadden = .097

  • Age Type Prediction Accuracy: Logistic Model, .695, McFadden = .163

Bibliography

Staff, A. S. P. C. A. (2019). Pet statistics. ASPCA. Retrieved November 3, 2021, from https://www.aspca.org/helping-people-pets/shelter-intake-and-surrender/pet-statistics.

Babej, M. E. (2011, May 23). Petfinder.com arranges 17 million adoptions by open branding, technology. Forbes. Retrieved November 7, 2021, from https://www.forbes.com/sites/marcbabej/2011/05/10/petfinder-com-arranges-17-million-adoptions-by-open-branding-technology/?sh=184e0d8fac4b.